Combining Process Replication and Checkpointing for Resilience on Exascale Systems

نویسندگان

  • Henri Casanova
  • Yves Robert
  • Frédéric Vivien
  • Dounia Zaidouni
چکیده

Processor failures in post-petascale parallel computing platforms are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback, has been recently advocated in the literature. We first derive novel theoretical results for exponential failure distributions, namely exact values for the Mean Number of Failures To Interruption and the Mean Time To Interruption for Exponential. We then extend these results to arbitrary failure distributions, obtaining closed-form solutions for Weibull distributions. Finally, we evaluate process replication in simulation using both synthetic and real-world failure traces, identifying scenarios in which process replication is beneficial. We also find that although the choice of the checkpointing period can have a high impact on application execution in the no-replication case, this choice is no longer critical when process replication is used. Key-words: Fault-tolerance, parallel computing, checkpoint/restart, process replication ∗ University of Hawai’i at Mānoa, USA † LIP, Ecole Normale Supérieure de Lyon, France ‡ University of Tennessee Knoxville, USA § Institut Universitaire de France. ¶ INRIA, Lyon, France ha l-0 06 97 18 0, v er si on 2 12 M ar 2 01 3 Utilisation conjointe de la réplication et de la prise de points de sauvegarde pour la résilience sur plates-formes exascales Résumé : Les pannes de processeurs seront des évènements courants dans les plates-formes post-petascale de calcul parallèle. La solution traditionnelle de tolérance aux pannes, la prise de points de sauvegarde et la ré-exécution, limite fortement l’efficacité des applications parallèles. Pour lever cette limitation, une solution est de répliquer les processus de l’application pour qu’une panne sur un processeur n’entraîne pas automatiquement une panne de l’application. La combinaison de la réplication de processus avec la prise de points de sauvegarde a été récemment préconisée dans la littérature. Nous dérivons d’abord des nouveaux résultats théoriques pour une distribution exponentielle des pannes: nous établissons des formules exactes pour le Nombre Moyen de Pannes avant l’Échec, et le Temps Moyen avant l’Échec. Nous étendons ensuite ces résultats à n’importe quel type de distribution, avec notamment des formules closes pour les distributions suivant des lois de Weibull. Finalement, nous évaluons la réplication de processus en utilisant des traces synthétiques et des traces réelles, et nous identifions les scénarios dans lesquels la réplication est bénéfique. Nous trouvons que le choix de la période de prise des points de sauvegarde peut avoir un fort impact sur la durée d’exécution des applications quand il n’y a pas de réplication. Par contre, avec la réplication des processus, le choix de la période n’est plus important. Mots-clés : Tolérance aux pannes, calcul parallèle, checkpoint/redémarrage, réplication ha l-0 06 97 18 0, v er si on 2 12 M ar 2 01 3 Combining Process Replication and Checkpointing for Resilience 3

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Toward Exascale Resilience

Over the past few years resilience has became a major issue for HPC systems, in particular in the perspective of large Petascale systems and future Exascale ones. These systems will typically gather from half a million to several millions of CPU cores running up to a billion of threads. From the current knowledge and observations of existing large systems, it is anticipated that Exascale system...

متن کامل

Optimal Checkpointing Period: Time vs. Energy

This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We characterize the optimal period for both objectives, and we assess the range of time/energy trade-offs to be made by instantiating the model with a set of realistic ...

متن کامل

Using group replication for resilience on exascale systems

High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-recovery, by which the application saves its state to secondary storage throughout execution and recovers from the latest saved state in case of a failure. An oft studied research question is that of the o...

متن کامل

Using replication for resilience on exascale systems

High performance computing applications must be tolerant to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-rollback, by which the application saves its state to secondary storage throughout execution and recover from the latest saved state in case of a failure. An oft studied research question is that of the opt...

متن کامل

Toward Exascale Resilience: 2014 Update

Resilience is a major roadblock for HPC executions on future exascale systems. These systems will typically gather millions of CPU cores running up to a billion threads. Projections from current large systems and technology evolution predict errors will happen in exascale systems many times per day. These errors will propagate and generate various kinds of malfunctions, from simple process cras...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012